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Abstract 

In this paper we study iterative procedures for stationary equilibria 
in games with large number of players. Most of learning algorithms for 
games with continuous action spaces are limited to strict contraction best 
reply maps in which the Banach-Picard iteration converges with geomet- 
rical convergence rate. When the best reply map is not a contraction, 
Ishikawa-based learning is proposed. The algorithm is shown to behave 
well for Lipschitz continuous and pseudo-contractive maps. However, the 
convergence rate is still unsatisfactory. Several acceleration techniques 
are presented. We explain how cognitive users can improve the conver- 
gence rate based only on few number of measurements. The methodology 
provides nice properties in mean field games where the payoff function 
depends only on own-action and the mean of the mean-field. A learning 
framework that exploits the structure of such games, called, mean-field 
learning, is proposed. The proposed mean-field learning framework is 
suitable not only for games but also for non-convex global optimization 
problems. Then, we introduce mean-field learning without feedback and 
examine the convergence to equilibria in beauty contest games, which 
have interesting applications in financial markets. Finally, we provide a 
fully distributed mean-field learning and its speedup versions for satis- 
factory solution in wireless networks. We illustrate the convergence rate 
improvement with numerical examples. 



1 Introduction 



Recently there has been renewed interest in large-scale games in several re- 
search disciplines, with its uses in financial markets, biology, power grid and 
cloud networking. Classical work provides rich mathematical foundations and 
equilibrium concepts, but relatively little in the way of learning, computational 
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and representational insights that would allow game theory to scale up to large- 
scale systems. The literature on learning in games with discrete action space is 
huge (see [J O [30] and the references therein). However, only few results are 
available for continuous action space. In this paper, we explain how the rapidly 
emerging field of mean- field games can address such behavioral, learning and 
algorithmic issues. 

We propose both model-based (but still with less information) and non- 
model-based learning schemes for games with continuous action space and large 
number of players. Each player will update her learning strategies based on 
an aggregative term [IS], which is the sum of action of the other players. Each 
player will be influenced by the aggregate, and the mean field behavior is formed 
from the contributions of each player. In the model-based mean-field learning 
scheme, the mean action will be read by the players at each time slot, and each 
player will respond to the aggregative term locally. This simplifies drastically 
the dimensionally of the best-response system in the asymptotic case. 

We distinguish different types of learning schemes depending on the infor- 
mation requirement: 

(i) Partially distributed strategic learning [30], where each player knows 
her own-payoff function, has some computational capabilities and observes the 
actions of the other players at the previous step. Examples of such learning 
schemes include best response algorithms, fictitious play, and logit algorithms. 

(ii) Fully distributed strategic learning: In many dynamic interactions, one 
would like to have a learning and adaptive procedure which does not require 
any information about the other players actions or payoffs and less memory 
as possible (small number of parameters in term of past own-actions and past 
own-payoffs). Fully distributed learning algorithms are only based on numerical 
measurements of signals or payoffs. The mathematical structure of own-payoff 
functions are not assumed to be known by the player. Hence, gradient-like ascent 
and best-reply algorithms cannot be used directly. The observations of private 
signals/measurements are not explicit in the actions of the other players. Based 
on numerical measurement of realized own-payoff, each player employs a certain 
learning pattern in order to learn the expected payoff function (payoff-learning) 
as well as the associated optimal strategies (model- free strategy-learning) . This 
type of learning algorithm is referred as Combined fully Distributed PAyoff and 
Strategy learning (CODIPAS, [29l EH E]). These algorithms are simple but 
they play an important role in terms of applications since they are based on 
experiments and real data. In the continuous action space case, the gradient of 
own-payoff is not observed and hence it needs to be estimated or learned if one 
wants to use a gradient-like ascent method. However, estimating an accurate 
gradient based only on a sequence of payoff measurements is not a trivial task. 

(iii) No-feedback learning where the players do not observe any numerical 
payoff measurement. These schemes are based only on estimations and offline 
adjustment. However, conjectures and hierarchical reasoning could be used in 
order to get consistent reactions. 

In all the above three categories of learning algorithms, the combination 
of the learning patterns of all the players form a multidimensional interactive 
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system. The question we address is whether it is possible to exploit the structure 
of the payoff functions in large-scale regime to reduce the complexity of the above 
learning algorithms (partially or fully distributed). 

The answer to this question is positive for some aggregative games. We 
examine three classes of mean- field learning frameworks: 

• Partially distributed mean-field learning where each player knows her own- 
payoff function, has some computational capabilities and observes the 
mean field at the previous step. Examples of such learning algorithms 
include best response to mean field and Boltzmann-Gibbs mean-field re- 
sponse. 

• Fully distributed mean-field learning schemes can be used in situations 
where each generic player in the large population is able to observe/measure 
a numerical value of her own-payoff (that could be noisy). These schemes 
are derivative-free and model-free. They can be applied in both mean-field 
games and mean-field global optimization. 

• No-feedback mean-field learning: There are some situations where it is 
difficult to feedback any information to the population of players and 
local measurement of own-payoff is not available. Then, the above two 
classes of learning algorithms that are based on feedbacks to the players 
are inappropriate. In that learning scheme without feedback can 
be employed if the payoff functions are common knowledge. 

1.1 Overview: learning for games with continuous action 
space 

We briefly overview fully distributed learning for games with continuous action 
space. One of the first fully distributed learning algorithms is the so-called 
reinforcement learning. While there are promising results in Markov decision 
processes with few number of states and actions, majority of reinforcement 
Q-learning, adaptive heuristic critic, and regret minimizing multi-arm bandit 
algorithms meet several difficulties in continuous action space. The difficulty in 
extending such learning algorithms to multi-player games is that with a balance 
has to be maintained between exploiting the information gained during learning, 
and exploring the set of actions (with is a continuum) to gain more information. 
Instead of updating a finite dimensional probability vector, one has to adjust a 
probability density function in infinite dimensional space. The authors in [32) 
has proposed reinforcement learning algorithms for games with continuous and 
compact action space applied to vehicle suspension control. The convergence 
analysis is not conducted in [22 • 

In |33| the authors studied continuous action reinforcement learning au- 
tomata and applied to adaptive digital filters. The authors claimed convergence 
of their algorithm via computer simulations. 

Recently, [34] observed a poor performance and selection of basis functions 
that are used in |32| I33j to approximate the infinite dimensional space. It is 
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conjectured that the convergence time (when it converges) is very high even for 
quadratic cost function. 

In order to reduce the dimensionahty, the basic idea in these continuous 
action space reinforcement learning studies has been to use a normal distribution 
at each time slot and updates the mean and standard deviation based on the 
payoff measurement. 

Another approach to continuous action space learning is the regret mini- 
mizing procedure that allows to be close to the Hannan set. Such a procedure 
have been widely studied for discrete action space and has several interesting 
computational properties: the set of correlated equilibria is convex and there is 
polynomial time algorithm. The extension to continuous and compact action 
space has been conducted in [35]. It is shown that the empirical frequencies 
of play converges to the set of correlated equilibria. However, most of these 
convergence results are not for a point but a set and the convergence time is 
not provided in [3S]. Another important point is that the convergence of the 
frequency of play does not imply the convergence of actions or strategies. All 
the above references consider finite number of players. 

In this work we are interested on learning in games with large number of 
players and continuous action space. The framework presented here differs 
from classical machine learning for large-scale systems. The main difference 
is the strategic behavior of the players who make decisions in a distributed and 
autonomous manner. This creates an interdependency between the decisions 
through the mean of the mean-field. 

1.2 Contribution 

Our contributions can be summarized as follows. First, we introduce a learning 
framework for games with large population of players, called mean-field learn- 
ing. Considering payoff functions that depend on own-input and an aggregative 
term, we show that mean-field learning simplifies drastically the analysis of 
large-scale interactive learning systems. In the single class case, it reduces to 
the analysis of one iterative process instead of an infinite collection of learning 
processes. Second, we study both asymptotic [3^ and non-asymptotic proper- 
ties of the resulting learning framework. Stability, error bounds and accelera- 
tion techniques are proposed for model-based mean-field learning as well as for 
derivative-free mean-field learning. In particular, we show that the convergence 
time of (o + 1)— order speedup learning is at most 



where 77* is the error target, C2 is a positive value which does not depend on 
time and ryo is the initial error gap. Interestingly, the methodology extends to 
satisfactory solution (which is a situation where all the players are satisfied). We 
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provide fully distributed raean-field learning schemes for satisfactory solution in 
specific games. Reverse Ishikawa and Steffensen speedup learning are proposed 
to improve the convergence rate to satisfactory solution. Numerical examples 
of the basic learning techniques are illustrated and compared in guessing games 
and in quality-of-service (QoS) satisfaction problems in wireless networks. 

1.3 Structure 

The paper is organized as follows. In the next section we present a generic mean- 
field game model. In Section [3] we present the mean-field learning framework. 
In Section 2] we present a detailed example of beauty contest mean- field game. 
Speedup strategic learning for satisfactory solution are presented in Section [5j 
Finally, section [B] concludes the paper. 
The proofs are given in Appendix. 

2 Mean field game model 

Consider n > 2 players. Each player takes her action in the convex set A C 
R'', d > 1. Denote the set of players by A/" = {1, 2, . . . , n}. In the standard for- 
mulation of a game, a player's payoff function depends on opponents' individual 
actions. Yet, in many games, payoff functions depend only on some aggregate of 
these, an example being the Cournot model where it is the aggregate supply of 
opponents that matters rather than their individual strategies. The main prop- 
erty of aggregative games is the structure of the payoff functions. The payoff 
function of each player depends on its own action and an aggregative term of 
the other actions. A generic payoff function in aggregative game with additive 
aggregative term is given by rj : A? — > K 



term of player j is the structure of the function r^, its own-action aj and the 
aggregative term i X)?'=i ■ 



The triplet Q :— (M ,A,{fj)j(^j^) constitutes a one-shot game in strategic 
form. 

Applications 1 The type of aggregate- driven reward function in (Op has wide 
range of applications: 

(a) In economics and financial markets, the market price (of products, good, 
phones, laptops, etc) is influenced by the total demand and total supply, 

(h) In queueing theory, the task completion of a data center or a server is 
influenced by the mean of how much the other data centers /servers can serve. 




(1) 



where the actions are real numbers and f : A' 



M. In this context the key 
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(c) In resource sharing problems, the utility / disutility of a player depends on 
the demand of the other players. Examples include cost sharing in coalitional 
system and capacity and bandwidth sharing in cloud networking. 

(d) In wireless networks, the performance of a wireless node is influenced by 
the interference created by the other transmitters. 

(e) In congestion control, the delay of a network depends on the aggregate 
(total) flow and the congestion level of the links/routes. 

Definition 1 The action profile (aj)j^^ is a pure equilibrium of the game Q if 
no player can improve her payoff by unilateral deviation i.e., for every player 
j £ Af, one has 

rjia) > rj{a'^,a^j), Va^- G A 

Before going for pure equilibrium seeking we first need to ask if the problem is 
well-posed, i.e; the existence of a pure equilibrium. Below we provide a classical 
sufficiency condition for existence of a pure equilibrium in continuous-kernel 
aggregative games. 

The following results hold: 

• compactness: If is a non empty, compact, convex subset of M'', and each 
rj is (lower semi-) continuous in and quasi-concave with the respect 
to the first variable then Q possesses at least one pure strategy Nash 
equilibrium. 

• If ,4 is non-compact, we require additional coercivity assumption: rj (aj , a-j ) 

— oo as II flj II — > +00. 

For discontinuous payoff function rj we refer to the recent development of 
existence of pure equilibria. A very active area is the full characterization for 
the existence of pure strategy Nash equilibrium in games with general topolog- 
ical strategy spaces that may be discrete, continuum or non-convex and payoff 
functions that may be discontinuous or do not have any form of quasi-convexity. 
For more details, see the literature review in Tian (2009, ^13)). 

Below we examine only the cases in which the game admits at least one 
equilibrium and the best-response is uniquely given by the mapping / (which 
is not necessarily continuous). We present aggregate-based learning algorithms. 
Generically, a partially distributed mean-field-based learning scheme can be 
written as aj^t+i = Fjio-j.t, ■ ■ ■ , aj^, fhn^t, ■ ■ ■ , "fn-nfl) where F is specified from 
the game model and rhn,t is a mean action of the players at time t. In this 
paper we examine only schemes with one-step memory in the form aj^t+i = 
Fj{aj^t,rnn,t) 

2.1 Aggregate-based best-response 

Let fhj^n = '^j'-^j be the mean of actions of the others. We assume that 
argmaxa' rj(a'j, rhj^n) has a unique element which we denote by fj{a-j). Then, 
/(a) = {fj{0'-j))j is the best-response map. We aim to find a fixed-point of such 
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a map in which is in an Euchdean space with dimension nd. Exploiting the 
aggregative structure of the game and the convexity of the set A, the domain 
of the function fj is reduced to the set A by the following relation fj{a^j) = 
fjirhj^n) where mj,„ € ^ by convexity of the domain A. The simultaneous best- 
response algorithm, called mean-field response, is given in Algorithm [T] which 
requires that player j observes the common term fhn,t-i = ^ Sj'=i '^j',t-i at 
the previous step and computes 

rhj,n,t-i = — ^— {nfhn^t-i - aj,t-i) ■ 



Algorithm 1 : Mean-field response 
1: Initialization : 

for each user j G Af initialize aj^o, 
2: Learning pattern : 

for each time slot t 
for each user j G M do 
Observe the aggregate m„^( 



2.2 Banach-Picard learning algorithm 

One of the first basic iterative procedures for finding fixed-point of the contin- 
uous map / over complete metric space A is the Banach-Picard iterate. The 
algorithm consists to start at some point oq G A, and take the compositions 
/(flo), /(/(a)), . . . , that is at+i — f{at), where / : A — > A. This algorithm 
is known to be convergent for strict contraction map, i.e., if there exists a Lip- 
schitz constant < L < 1 of the function / then the iterates converge (with 
geometric convergence rate L) to the unique fixed-point of / in A. However, 
in many applications of interest the function / may not be a contraction. See 
Example [T] below. 

Theorem 1 ([5j) Let {A,d) be a complete metric space, and f : A — > A a 
map for which there exist real numbers ai, and satisfying 0<ai<l,0< 
Q^2,Q!3 < 1/2 such that for each pair ai,a2 in A, at least one of the following 
conditions is true: 

(CO) d{f{ai)J{a2))<aid{ai,a2); 

(CI) d(/(ai), /(aa)) < a2M(ai, /(ai)) + d{a2, /(as))]; 

{C2) d{f{ai), f{a2)) < aMai. f{a2)) + d(a2, /(ai))]. 

Then, the Banach-Picard algorithm converges to a fixed-point of f for any initial 
point ao G A. Moreover, f has a unique fixed-point. 
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Note that the first part of Theorem [T] CO has many apphcations in solving 
nonlinear equations, but suffers from one drawback - the contractive condition 
(CO) forces / to be continuous on A. However, the second condition (CI) does 
not require continuity of the best response function /. The condition (CI) have 
been studied by Kannan (1968, fT^) and (C2) by Chatterjea (1972, [11]). 

Example 1 (Resource sharing game) Consider n players in a network with 
a resource capacity c„ > 0. Each player has a certain demand Oj > which 
corresponds to her action. The action space is . Denote by pn > the unit 
price for each resource utilization. The payoff function is 



where e„ is a positive parameter. Then, the (simultaneous) best-response algo- 
rithm is given by 



at+i 



and 



/(«) 



W — (e« + (n- l)at-i) - (e„ + {n - l)at_i) 
V Pn 



W — (e„ + (n - l)a) - (e„ + (n - l)a) 

V Pn 



A direct computation of the derivative (at the interior) gives 

f(a) = -in - 1) + 2^lil^ ('±{en + [n ~ l)a) ^'^ 

Pn \Pn y 

Clearly, f is not a contraction. 

Example 2 A non-convergent Banach-Picard iteration is obtained for A = 
[■|,4], and f{a) = 1/a. f is 16— Lipschitz in the domain A. Let start with oq ^ 1 
then a2t — oq, a2t+i = — ^ ^o- The two subsequences a2t and a2t+i, t > 



have different limits. Hence, the sequence a^^ 
starting point is Oq ^ 1. 



f{at) does not converge if the 



We explain below how to design a convergent sequence for the problem of Ex- 
amples [1] and [2j A simple modification of Banach-Picard consists to introduce 
a learning rate A which takes the average between the previous action and the 
best response, Ot+i = ^f{o,t) -|- (1 — X)at. Then, the procedure evolves slowly. 
The idea goes back at least to Mann (1953, [S]) and Krasnoselskij (1955, [7]). 
For A equal to one, one gets the Banach-Picard algorithm. 

If the function / satisfies (Lipschitz and strongly pseudo-contractive map) 

(C3) ||ai-a2|| < ||ai-a2+s[ai-/(ai)-fcai-(a2-/(a2)-fca2)]| 

for any pair (ai, 02) G A^ , where s > 0, fc > 0, and there is A such that for any 
A, < A < A, the learning algorithm converges to a fixed-point. However, for A 
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closer to one, the algorithm may oscillates around a fixed-point. In order to get 
a smaller A in the long-run, we can attempt to take a decreasing learning rate 
At — > as t grows. J^t^i ~ +00. 

The algorithm (also referred to as Mann's algorithm [S]) reads 

at+i = Xtfiat) -f (1 - Xt)at (2) 
< At < 1, aoE A. (3) 



Algorithm 2 : Mann-based mean-field response 
1: Initialization : 

for each user j (z Af initialize a^.o. 

Define the sequence up to T : Aj,f for t € {1, 2, . . . , T} 
2: Learning pattern : 
for each time slot t 
for each user j G J\f do: 
Observe the aggregate m„ t 

aj,t+i = ^i.t/j[;r3T("-™«,t-i - ai,t-i)] + (1 - ^i,t)aj\t 



If the function / does not satisfy CO, one can still get some convergence 
results due to Ishikawa [TOl E] . 

Of+i = At/ {pLtf{at) + (1 - Mt)at) + (1 ^ ^t)at (4) 
< At < 1, < /^t < 1. (5) 
ao e A. (6) 

Clearly, the same technique can be extended to a finite number of composi- 
tions of the mapping /. 

Algorithm 3 : Ishikawa-based mean-field response 
1: Initialization : 

for each user j G A/" initialize a^.o. 

Define the sequence up to T : Aj^t for t €E {1, 2, . . . , T} 
2: Learning pattern : 

for each time slot t 

for each user j S A/" do: 

Observe the aggregate ?7i„.t 

a],t+i = Xj^tf]{y],t) + (1 - Aj,t)aj,t 

yj,t = Mi,t/j ( ;j^(nm„,t-i - aj,t-i)) + (1 - Aij,t)aj,t 



Figure [T] illustrates a cycling behavior of the mean- field response of example 
[TJ The parameters are /ij^t = 0,At = 0.9, n — 10, e„ = 0,Cn = l,Pn = l,ao = 
0.005. For A = 0.1 smaller than the one in Figure[l] the cycle disappeared and 
the Ishikawa's based mean-field response scheme behaves well. See Figure [2l 
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0.1 0.2 0.3 



Action 

Figure 1: Mann-based mean- field response for example [T] witli big learning rate: 
presence of limit cycle. 

2.3 Faster algorithms: Banach-Picard vs Ishikawa 

For a class functions satisfying one of the conditions C0-C2, we know from 
Theorem [T] that there is a unique fixed-point of / and the speed of convergence 
of the algorithm can be compared for different parameter fit ■ Suppose that a* 
and bt are two real convergent sequences with limits a* and b* respectively. 
Then {at}t is said to converge faster than {bt}t if hmt ^^^^^'j^.j* = 

The authors in [6l [5] showed that for a particular class of functions that 
satisfies one of the conditions CO, CI or C2, the Banach-Picard algorithm is 
faster than the Ishikawa's algorithm with /if = is faster than the one with 
yUt > 0. However in general these algorithms are not comparable due to non- 
convergence. For example for /(a) — 1/a, A — [1/4,4] there is a unique fixed 
point but Banach-Picard does not converge starting from ap 7^ 1. However, the 
Ishikawa method converges to 1 for A small enough. 

2.4 Reverse Ishikawa mean-field learning 

The reverse Ishikawa's mean-field learning consists to choose a learning rate 
(bigger than one) that converges to one. 

flt+i = Af/(af) + (1 - \t)at (7) 
1 < At < 2, limAt = 1, ao e A. (8) 

Example of Af could be l-l-j^. When convergent, this scheme has the advantage 
of being faster than the fixed-point iteration. 
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0.1 0.2 0.3 



Action 

Figure 2: Ishikawa's based mean-field response for exainple[l]with small learning 
rate. 

2.5 Non-asymptotic properties 

We focus on the non-asymptotic properties of the Ishikawa-based learning al- 
gorithms. Non-asymptotic strategic learning is very important in engineering 
applications. Traditional results on the fundamental limits of data compression 
and data transmission through noisy channels apply to the asymptotic regime 
as codelength, blocklength goes to infinity. However, these asymptotic results 
are not usable if the window size and horizon are bounded. Therefore, it is 
interesting to look at the non- asymptotic regime of learning algorithms. We 
provide generic rate of convergence of some class of best-response functions. 

2.5.1 Strict contraction 

For strict contraction mapping / with constant ai = L < 1, one has the follow- 
ing estimates: 

a\ 

d(at,a*) < d(ao,a*), t > 0. 

1 — ai 

The advantage of this inequality is that it provides a error gap at any time 
which is a non-asymptotic learning result. 

2.5.2 Nonexpansive best-response function 

The next result provides the convergence rate of the asymptotic regularity for 
nonexpansive maps i.e., the class of map with Lipschitz constant ai = 1. Denote 
by diameter{A) the diameter of the set A. 
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Theorem 2 ([4]) Let f : A — > A is a nonexpansive self-mapping of a bounded 
convex subset A of a normed linear space, normalized so diameter(A) < +00 
with non-empty fixed-point set fix{f)\ the Ishikawa algorithm with /Zt = and 
\t = X £ (0, 1), is a proper convex combination then 

d{atj{at))^0. 

Moreover 

diameter (A) 

d{atj{at)) < 



If A is unbounded, the following estimate holds: 

d{aQ, fix{f)) 



diatj{at))<2 



We observe that when J2l'=i ^t'{^ — Af) — > +00 one gets the so-called asymp- 
totic regularity of the sequence generated by the Ishikawa algorithm. 



2.5.3 Strongly pseudocontractive 

Theorem 3 (| .14j ) Let f Lipschitz with constant L and strongly pseudocon- 
tractive with constant k such that fix{f) is non-empty. Then the Ishikawa 
algorithm with At = A € (0, A), A = (L+i)(L+2-fc) ' t^t = converges strongly 
to the (unique) fixed point of fix{f). The convergence rate is geometric and is 
given by 

d{aufix{f))<d{ao,fix{f))p{\y 

where 

l + (l-fc)A + (L + l)(L + 2-fc)A2 

P^^^ = TTa ' 

which is minimized for A* = — 1 + yl + A. 

2.6 Asymptotic pseudo-trajectories 

Using classical approximation the asymptotic pscudo trajectories of Ishikawa's 
algorithm can be studied using ordinary differential equations (ODEs). We 
assume that the function / has a unique integral curve, in order to guarantee 
the uniqueness of the ODE (given a starting point). 

2.6.1 Ait = 

The ODE is given by 

d] = ^flj = fj{mj,n)-aj, 
which is the aggregate-response dynamics in continuous time. 
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If fj denotes a best response of rhj^n then the steady state of the aggregate- 
response dynamics are Nash equihbrium of the aggregative game. If the dy- 
namics has a Lyapunov function then one gets a global convergence to Nash 
equilibria. These games arc called Lyapunov games. 

Now we turn to the convergence time for a given error/precision. We define 
the convergence time of at within a 77— neighborhood as the first time the 
trajectory of pure strategy reaches a neighborhood of range r] to the set of 
fixed-points. 

T^ = inf{t>0, I d{at,fix{f))<r]}. 

2.6.2 How to accelerate the convergence time of the ODE? 

Consider the ordinary differential equations (ODEs) that captures the trajec- 
tories of the learning pattern in transient phase: cit = f{at) and bt = At/(6f). 
Assume that the two ODEs start at the same point ao = bo- What can we say 
about the trajectories of a and 6? 

We explicitly give the convergence time of b in function of that of a. 

Proposition 1 The explicit solution is given by 

^* = «/„*A, ds- 

In particular, if the trajectory of a reaches a target set O for at most Ta time 
units then the trajectory of b reaches the same set for at most Ti, — g~^[Ta) 
where g : t — > Xg ds. 
7/As = A, thenTb='^. 

If = e*, then T},= \n{Ta + 1), i.e., ^ goes to zero and b is faster than a. 

3 Mean-field learning 

We now consider a continuum of players. The mean-field is the action distri- 
bution TO. Its mean is in the set A and is denoted by to. Then the limiting 
payoff for single class writes r(o,TO). All the functions fj above are reduced a 
single function f{fh), i.e., each player responds to the mean of the mean-field. 
Therefore the Banach-Picard (mean-field response) dynamics becomes 

at+i = f{at,rht) 

which can be reduced to 

at+i = f{at, fht) = fht+i = /(to*, frit) =■ f{mt) (9) 

by indistinguishability property. 

Thus, Banach-Picard-based mean-field response learning is given by 

ffit+i = fifht), (10) 
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Note that in Equation pHI) only the starting point mo is required if the player 
knows the structure of /. This means that it is not needed to feedback the mean 
of the mean-field at each step. 

Proposition 2 Consider a mean field game r{a, fa) such the mean field re- 
sponse function f is a strict contraction mapping over A (non-empty, con- 
vex subset of K^) with constant ai . Then, the mean-field response learning 
nop finds an approximated fixed-point within a rj— neighborhood in at most 

— 1 -\- [max(0,r)J number of iterations where T = — . 
The Ishikawa-based mean field response is 

mt+i = ^tf iti'tfimt) + (1 - ^J.t)mt] + (1 - Xt)rht (11) 

< At < 1 , < /^t < 1 (12) 
Too e A. (13) 

Based on Theorem [21 the next Proposition provides an upper bound of the 
convergence time in order of 0{:^). 

Proposition 3 Consider a mean field game r{a,fh) such the mean field re- 
sponse function f is a nonexpansive mapping over A (non-empty, convex subset 
ofW'-). Then, the Ishikawa based mean- field learning finds an approximated 
fixed-point within an r]— neighborhood in at most T^j — 1 + [max(0,T)J number 
of iterations where T = ^6rf(™o J|ia:(/)) ^ 

Proposition 2] provides a convergence time bound in order of 0(ln(i)) for 
Lipschitz with constant L and strongly pseudocontractive mean-field response 
function /. 

Proposition 4 Let f Lipschitz with constant L and strongly pseudocontrac- 
tive with constant k such that fix{f) is non-empty. Then the Ishikawa algo- 
rithm with Xt — X € (0,A),A — (^^i)(j^^2-fc) ' = converges strongly to 
the (unique) fixed point of fix{f) with at most = 1 + [max(0,T)J number 

of ^terat^ons where T = '"^T'l"""' : P(A*) = 

A* = -1 + ^/T+X. 

3.1 How to accelerate the convergence rate? 

We have seen in the previous sections that under suitable conditions one can ap- 
proximate fixed-points. However, the major concern associated with the above 
fixed-point iteration is that the iterates exhibit only linear convergence rate 
which may be unacceptably slow. 

Speedup strategic learning is a method that studies learning mechanisms 
for speeding up the convergence based on few experiences. The input to a 
speedup technique typically consists of observations of prior realized sequence 
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of experience, which may inckidc traces of the measurements of the real problem. 
The output is knowledge that the technique can exploit to find solutions more 
quickly than before and without seriously effecting solution quality. 
Our motivation for speedup learning is two-fold: 

• Traditional results on the fundamental limits of data compression and 
data transmission through noisy channels apply to the asymptotic regime 
as codelength, blocklength goes to infinity. However, these asymptotic re- 
sults are not usable if the window size and horizon are bounded. Speedup 
strategic learning is aimed exclusively at finding solutions in a more prac- 
tical time frame. Therefore, it is interesting to look at the non- asymptotic 
regime using speedup learning algorithms. 

• Speedup strategic learning aims to create adaptive scheme that can learn 
patterns from few number of experience that can be exploited for effi- 
ciency gains. Such adaptive schemes have the potential to significantly 
outperform traditional learning schemes by specializing their behavior to 
the characteristics of the fixed-point problem. 

Consider a convergent mean-field learning with exhibit a sequence {fht}i. 
Suppose now that only few number of the sequence are available to the users. 
Each user aims to learn an approximate fixed-point based only minimal number 
of information about the estimates mo, fhi, . . . , friT-i- The goal of a generic 
user is to accelerate the previous learning algorithm and transform the slowly 
converging sequence into a new one that converge to the exact limit m* as the 
first one, but faster. If possible, we aim to be as close as possible to in* based 
only on the T observations of the sequence {rht}t- 

Definition 2 Assume that fht converges to fh* and let r]t = \fht — fh*\. If If 
two positive constants ci,o > exist, and limsup^ = Ci 

then the sequence {fht}t is said to converge to rh^ with order of convergence 
o. The number ci is called the asymptotic error constant. The cases o e {1,2} 
are given special consideration. 

(i) If = 1 the convergence of {fht} t is called linear. 

(ii) If o = 2 the convergence of {fht} t is called quadratic. 
(Hi) If o = 'S the convergence of {fht} t is called cubic. 



3.1.1 Partially distributed speedup methods 

We present speedup learning in one-dimensional space. However, most of the 
speedup schemes below extends to multi-dimensional action spaces. 

Quadratic order speedup techniques 

The basic Newton method consists to iterate 

mt+i=mt 77^- (14) 

9'{mt) 
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If 5(771*) — and g'{fh*) 7^ then the Newton's method generates a 
quadratic convergence rate locaUy: rjt+i = \mt+i — fh*\ < 2"''' ^?' where 
at is in a neighborhood of m* and h{y) = y — ^j^- 

For multiple roots, the scheme can be modified to be 

rut+i = mt 



g'{mt) 



or apply the Newton method to the function ^. 
3.1.2 Cubic order speedup techniques 

One of the most studied cubic technique is the Halley's method which consists 
to update as 



mt+i = mt 



2[g'(TOt)]2 - g{mt)g"{mt) 



3.1.3 Arbitrary order speedup techniques 

We start with Householder's speedup method [27]. If the map / is known by the 
player then the classical fixed-point iteration is rht+i = gijht) — firht) — fht- A 
(o + 1)— order speedup learning is given by 

_ (l/g)(°-i)(m,) 
mt+i =mt + o 

(l/g)(°)(mt) 

where o is an integer and {l/g)''°^ is the derivative of order o of the inverse of 
the function g. 

It is well-known that if / is a {o+V) times continuously differentiable function 
and rh* is a fixed-point of / but not of its derivative, then, in a neighborhood 
of m* , the iterates fht satisfy: 

\m,t+i - fh*\ < C2\mt - m*|°+\ 

for some constant C2 which is obtained by taking the bound of the derivatives 
of the function g at m* . The bound is finite because of continuity over compact 
set. This means that the iterates converge to the fixed-point if the initial guess 
is sufficiently close, and that the convergence has rate (o -I- 1). Thus, if / is an 
infinitely differentiable function, this scheme makes a very fast locally convergent 
speedup learning algorithm with arbitrary high order. In particular, for o = 1 
this is the Newton's method, for o — 2 it is called Halley's method. 

For the case where the function / is smooth with a unique fixed-point, we 
can systematically generate, high order, quickly converging, mean-field learning 
methods, of any desired degree, for the solution of fixed-point problem. Fast 
converging mean-field learning methods like those should be of great use in 
large-scale algorithms that require the repetitive solution of a nonlinear equation 
many times over long time periods and where an efficient solution algorithm is 
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imperative to avoid overtime computations. This reduces the so-called cost of 
impatience, i.e., the cost due to the error gap to the solution. We now present 
our main result on the convergence time of the (0+ l)-order speedup scheme. 

Theorem 4 Let C2 < 1 and rjQ — \fho — fh*\ < 1 and o > 0. Then the scheme 
is convergent and the error at iteration t, rjt is hounded by 



m < C2 



(o+l)' 

Vo 



Thus, the convergence time to an rj* — range of the pure mean-field equilibrium 
is 

1 + [max(0, T)J 



whe 



1 



ln(o+l) 



In 



In 



In 



3.1.4 Derivative-free speedup methods 

The most well known sequence transformations are Aitken's A^— process (1926, [18]), 
Richardson's extrapolation algorithm (1927, [23]), Shanks transformation (1955, [22]), 
Romberg transformation 1955 [19], Wynn's e— algorithm (1956, [25l[26]). These 
speedup techniques are not based on derivatives. This is why they are used 
more frequently in practical computations. 

Next we present a superlinear order derivative-free speedup technique, called 
Secant method. It is inspired from Newton's method where the term ) 

^ a— a 

replaces the derivative. 



Secant speedup method 

In the Secant speedup method ( Algorithm^) , we define the sequence m2, m^, fri^^, 
using two initial guesses, mo and fhi and the formula: 



H+l 



mt 



g{mt){m,t ~ m,t-i) 
9(mt) - 9{mt-i) 



(15) 



g(jnt)-g{mt-i) 



in Equation (ITl)) . 



which can be obtained when replacing g'{mt) by 
Note that the Secant method can be written as mj+i = F{fht, rht-i) which is a 
two-step memory scheme, since two previous values are required to calculate the 
next value in the sequence. The Secant speedup method converges with order 
around 1.6, i.e., more quickly than a method with linear convergence, but more 
slower than a method with quadratic convergence. 
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Algorithm 4 : Secant speedup method 



Initialization : 

Make a starting guess toq, rhi 
Speedup learning pattern : 

for each time slot t 

Observe fht and compute g{mt) 

Compute fht+i ^mt- '''^^^^^''(Z':^^ 



Aitken's speedup method 

The new sequence that accelerates the convergence via Aitken extrapolation is 

{fht+i - fhtY 

yt = rrit z : —, 

mt+2 - 2TOt+i + mt 

which is obtained by solving the equation — mt+2-y ^ 

Steffensen's speedup method 

Steffensen's speedup method (Algorithm [S|) is a variant of the Aitken method 
that uses the Aitken formula to generate a better sequence directly: 



Algorithm 5 : Steffensen's speedup method 
1: Initialization : 

Make a starting guess uq 
2: Speedup learning pattern : 

for each time slot t 

Compute mi = /(too),?ti2 = /(^i) 

Use Aiken's speedup method to compute yo 

restart with yo 



Remark 1 Sometimes one has attempted to compare Newton, Secant, Aitken 
and Steffensen speedup methods. So, which method is faster? Ignoring con- 
stants, it would seem obvious that Newton's method (model-based speedup tech- 
nique) is faster than Secant method (non-model speedup technique), since it 
converges more quickly. However, to compare performance, we must consider 
both computational cost and speed of convergence. An algorithm that converges 
quickly but takes a few seconds per iteration may take far more time overall than 
an algorithm that converges more slowly, but takes only a few milliseconds per 
iteration. So, the comparison is not fair. For the purpose of this general anal- 
ysis, we may assume that the computational cost of an iteration is dominated 
by the evaluation of the function. So, the number of function evaluations per 
iteration is likely a good measure of cost. The secant method requires only one 
function evaluation per iteration (the function g), since the value of g (frit- 1) 
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can be stored from the previous iteration. Newton's method requires one func- 
tion evaluation and one evaluation of the derivative per iteration. It is difficult 
to estimate the cost of evaluating the derivative in general. In some cases, the 
derivative may be easy to evaluate, in some cases, it may be much harder to 
evaluate than the function (if it is possible at all). If we can run two itera- 
tions of the secant speedup method in the time it will take to run one iteration 
for computing the derivative in Newton's method. Then, two iterations of the 
Secant speedup method should be compared to Newton's speedup method. But 
Secant method with two iterations has a speedup in order o/2o > 2, hence faster 
than the Newton's method. 

The Aitken speedup method requires three consecutive terms of the sequence 
fht to reproduce a quadratically convergent speedup technique. 

3.2 Fully distributed derivative- free mean field learning 

We now present a fully distributed mean-field learning based on the work of 
[24] . Each generic player adjusts its action based on numerical measurement of 
own-payoff (with some i.i.d noise). The first order learning scheme with large 
number of players is given in Algorithm [51 



Algorithm 6 : Mean-field learning with first order sinusoidal perturbation 
1: Initialization : 

for each user j E J\f do 

2: Learning pattern : 

for each time slot t 
for each user j £ J\f do 
Observe a realized payoff rj^t 
aj,t+i = Oj^t + ^j.tkjrj^t^] sin{wjij -f (j)^) 
aj^t = flj.i + sin{wjtj + (pj) 
h = Y,\'=i ^'j,t' 
where Xj^t, kj, ej > 0, cfij G M. 



The mean of the mean-field generated by Algorithm [6] is given by 

rhj^tJri = iTij.t + liminf — ^j,tkjrj^t£j sm{wjij -f 0^), 

i 

mj,t = rhj,t + liminf — ej sm{wjij -\- (jij). 

i 

Let rij^t '■— Oj^t ~ a*. From Taylor expansions dajrj{at) — ritd^.g^Tjia*) -\- 
0(|r/fp) and d1.^Tj{at) = d1^^.rj{a*) -t- 0(|77t|). In the first order sinusoidal 
learning, the error rate is proportional to the second derivative (Hessian of the 
payoff). Since the payoff function is unknown to the players, it is difl[icult to 
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tune the convergence rate with the appropriate parameters. Therefore, we es- 
timate the Hessian and construct a dynamics that converge asymptoticahy to 
the pseudo-inverse of the Hessian. The local behaviour of the second order sinu- 
soidal learning will be independent of the Hessian which is of big importance in 
non-model learning where the Hessian is unknown. The second order sinusoidal 
perturbation mean- field learning is given in Algorithm [T] 



Algorithm 7 : Mean-field learning with second order sinusoidal perturbation 
1: Initialization : 

for each user j E J\f do 

2: Learning pattern : 

for each time slot t 
for each user j & Af do 
Observe a realized payoff rj^t 
aj,t+i = aj,t + Xj,tkjdfjrj^tj-sm{wjij + 
aj^t = aj,t + sm{wjtj + 4>j) 
dgVi - (1 + ><,.tw,)df} + A,- 1 

where \j^t,kj,Wc,tj > 0, (pj G M. 

(2) 

The product of s)j^rj^t needs to generate an estimate of the Hessian in a 

"(2) 

time-average sense and dj l^-^ should generate an estimate pseudo-inverse of the 
Hessian. Example of function sfj is -^{sin^^Wjij + <f)j) — 1). 

3.3 Feedback-free mean-field learning 

Below we develop a learning algorithm without feedback (no mean-field feed- 
back, or other actions at the previous step are not observed, [TB]) but with 
knowledge of the mathematical structure of the payoff function. Feedback-free 
mean-field learning is very important and is therefore empirically testable. In 
it, players think that others are likely to take the some function of actions as 
themselves, resulting in a false consensus or non-false consensus depending on 
one's view of the irrationality of the behavior and incompatibility of beliefs and 
conjectures. 

A simple example of feedback-free learning consists to take a speedup version 
of fht+i = fifrit) starting from some estimate of the initial point and iterate 
offline. 

Mean-field global optimization 

Consider the average payoff in the form i X^jgA/" '"('^i' ^j,n,t) which is in general, 
not concave in the joint action. The asymptotic regime is / r(a, to) m{da) which 
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can be analyzed using measure theory. In particular, if the integral can be 
written as a function of only the first moment of the mean- field m, i.e., f(m) 
then the above mean-field learning schemes can be used to learn the mean-field 
social optimum. 

Example 3 We consider the payoff function in the following form: rj (a) = 

ajh( ^'7/ °' ) ^po.j ■ Then total payoff of all the players is'^^^i ^jia) = Dh{D/n)— 
pD where D is the total sum of actions. By dividing by n, one gets 

1 ^ D D 

- r, (a) = —h{D/n) - p— = fhnh{fhn) - pfhn 

n n n 

The dimensionality of the global optimization problem can be significantly re- 
duced to be one- dimensional, i.e., It suffices to optimize the function zh{z)—pz. 
The local extrema of this scalar function can be found using the above techniques. 

This means that a mean-field optimization can be conducted easily for any 
limiting function in the form of ah{fh) + /3ia + 1329 i'fn) + /^s where /Jj's are real 
numbers and h, g are limiting functions of the mean of the mean-field. 

In next section, we apply mean-field learning and speedup techniques in 
beauty contest game or guessing game. 

4 Beauty contest game 

We revisit the beauty contest game in the context of mean-field. The name 

of this game and its basic idea go back to John Maynard Keynes (1936) who 
compared a clever investor to a participant in a newspaper beauty-contest where 
the aim was to guess the average preferred face among 100 photographs. The 
initial beauty contest game was analyzed for integers, although in 1993 the 
German economist Rosemarie Nagel based her experiments on a nice variant 
of the game, played by Keynes' newspaper readers: Each player chooses a real 
number between and 100 inclusively. The number need not be an integer. A 
player wins if its number is closest to 2/3 of the average of the numbers given 
by all participants. 

There arc two ways to see that a unique equilibrium solution exists. First, 
one can easily see that no one should submit a number higher than 66, because 
whatever the others do, a guess higher than 66 cannot be better than 66. How- 
ever, if no one guesses more than 66, then all numbers between 44 (that is, two 
third of 66) and 66 are inferior to 44. Hence no one would guess more than 44 
and so on. until is the only remaining reasonable choice. (Note that a second 
solution besides is 1 if only integers are allowed). 

A second method would be to just try out: Presume you guess a certain 
number a, and anyone else guesses the same number a, would you still wish to 
stick to your initial guess a? If so, you have found a symmetric Nash equilibrium. 
Now the only a to which you would want to stick is 0. 
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More generally, in a beauty contest game, n > 3 players simultaneously 
choose numbers aj in some interval, say, [0, M], M > 0. The average of their 
numbers m„ = i Uj is computed, which establishes a target number pm„, 
where < p < 1 is a parameter. The player whose number is closest to the 
target pm„ wins a prize a„. The generic payoff is 



rj (a) = 



-^{aj Garg min^/ \a'—pnin\} 
j ^ 

-^{aj (^arg min^/ |a^— pm^il} 



This model of beauty contest games were studied experimentally by Nagel 
(1995, [m). 

These games are useful for estimating the number of steps of iterated domi- 
nance players use in reasoning through games. To illustrate, suppose p < 1. 
Since the target can never be above pM, any number choice above pAI is 
stochastically dominated by simply picking pM. Similarly, players who obey 
dominance, and believe others do too, will pick numbers below p'^M so choices 
in the interval {p^M, M] violate the conjunction of dominance and one step of 
iterated dominance. We iterate this progressively and get that p*M — > 0, then 
the unique Nash equilibrium is 0. Now, if p > 1 then the equilibrium is M. If 
p = 1 every feasible symmetric action profile is an equilibrium. 

We now consider a small modification of the guessing game in order to get 
interior equilibria. The target is changed to be /i -|-pm„, /i > The generic 
payoff is 
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where (i? > 0,k > 0, ^> 0). 

In the asymptotic regime, the interior mean- field equilibrium (if any) should 
satisfy fn — fi + prh,p < 1, i.e., to* = in the interval [0, M]. 

Assuming that the functions are known, we can use a feedback-free mean- 
field learning. Therefore, one possible explanation of no-feedback learning in the 
beauty contest game (or guessing game) is that players simply take an action, 
treat their action as representative of the choices of other players, and then best 
respond to this belief mean of the mean-field. This kind of reasoning would 
predict convergence towards equilibrium in the guessing game. 

The best response dynamics is given by min(Af , fi+pfht^i) where the starting 
point is Too > 0. Each trader starts with an estimate toq. It is clear that for 
fi — 0,p < 1 if each trader estimates the initial point and iterate the mean-field 
learning process offline, the process converges to the mean-field equilibrium. For 
fi > 0, the interior response writes TOf+i = fi + pfht. 

More generally, one can consider a mean-field payoff in the following form 

rj{aj, fh) ~ R — K\\aj — x(™)||, 

where x is map which has a fixed point in [0, The best response to the mean 
TO is x("^) ^-i^cl the mean-field pure equilibrium satisfies a* — ffi* — x(™*)- 
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Hence, learning the mean-field equilibrium reduces to learning a fixed-point of 
the maping x- For x("^) = v2^i + 3 the iterative process becomes fht+i = 
V3 + 2m(. Note that 3 is a fixed-point. Let fix the starting point at mo = 4 
and M = 100. 

We start with g{'fh) = fh^ — 2m — 3 and use the secant speedup method. The 
result of acceleration technique (|15p is presented in Table [TJ We observe that at 
the fourth iteration, the secant speedup method has already 10~* of precision 
while the original sequence has a 100 times smaller precision. Next we start 
with the initial point 5 for secant method and initial point of 4 for the fixed 
point method. We observe that the secant method has better precision than the 
fixed-point method after only 3 iterations. This means that the secant speedup 
method is robust to initial estimation errors. 

Original sequence Secant speedup method 



mo = 4 4 5 

mi = 3.316624790 3.3166 3.6056 

7712 = 3.103747667 3.0595 3.1833 

7713 = 3.034385495 3.0043 3.0232 

7714 = 3.011440019 3.0001 3.0010 



Table 1: Fixed-point and Secant speedup method 



We summarize the acceleration technique in Table [2] based only on few steps 
of the original sequence. 



Original sequence 

fho = 4 
mi = 3.316624790 

7712 = 3.103747667 

7713 = 3.034385495 

7714 = 3.011440019 



Aitken 
3.007431293 
3.000862083 
3.000097228 



Steffensen 
3.000000510 
3.000000000000002 



Table 2: Acceleration of mean- field learning 



Assume that only 5 measurements of the mean sequence is given to the 
player: 77io = 4, 77ii = 3.316624790, m2 = 3.103747667,7773 := 3.034385495, 7714 = 
3.011440019. We apply the acceleration technique from these measurements and 
one gets yo = fho - = 3.007431293 

yi = 3.000862083 and 7/2 = 3.000097228 

Clearly, the Aitken sequence {yt}t guarantees that will converge faster to 
3 and the error will be smaller than that of the original sequence {Tht}t- If we 
reiterate the fixed-point of the sequence but starting from the sequence yo. Then, 

zo = yo — y^l2ij'i+yo ~ 3.000000510. Repeating the acceleration procedure and 
taking the fixed-point iteration, one gets Zi = 3.000000000000002. The sequence 
{zt} seems converges to 3 with at least quadratic convergence rate which is great 
acceleration from the linear convergence rate of the original sequence. With only 
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Figure 3: Illustration of the convergence time bound from Theorem |4l 5 itera- 
tions provide already remarkable result. 

two iterations of Steffensen's speedup method one has reached an error of 10~^ 
which is satisfactory. In Figure [3] we illustrate the bound of Theorem U] which 
states the convergence time with error rf = 10""*. We observe that the result of 
Figure [3] are similar to the one obtained in Table [2j 

5 Speedup strategic learning for satisfactory so- 
lution 

One of the fundamental challenges in distributed interactive systems is to design 
efficient and fair solutions. In such systems, a satisfactory solution is an innova- 
tive approach that aims to provide all players with a satisfactory payoff anytime 
anywhere. Our motivations for satisfactory solution seeking are the following: 
In dynamic interactive system, most users constantly make decisions which are 
simply "good enough" rather than best response or optimal. Simon (1956, [33]) 
has adopted the word "satisficing" for this type of decision. Most of literature 
of strategic learning and decision making problems, however, seek only the op- 
timal solution or Nash equilibria based on rigid criteria and reject others. As 
mentioned by Simon himself in his paper in page 129, "Evidently, organisms 
adapt well enough to 'satisfice'; they do not, in general, 'optimize'. Therefore 
satisfactory solution offers an alternative approach and is closely model the way 
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humans and cognitive users make decisions [371 ISHl [3S] ■ 

Here, a satisfactory strategy is a decision-making strategy that attempts 
to meet an acceptabihty threshold. This is contrasted with optimal decision- 
making or best response strategy, an approach that specifically attempts to find 
the best option available given the the choice of the other users. Following that 
idea we define a satisfaction solution as a situation where every user is satisfied, 
i.e., her payoff is above her satisfactory level. 

In this section we focus on fully distribution strategic learning for satisfactory 
solution in games with continuous action space. For discrete (and finite) action 
space we refer to HO]- We show that the methodology in [5D] can be extended 
to continuous action space as well as to first moment mean-field games. We 
illustrate it with a basic example. See also [21]. 

Definition 3 The action profile {aj)j^j\f is a pure satisfactory solution of the 
game Q if all the players are satisfied: 

fAo)>ih ^^^^ 

Before going for pure satisfactory solution seeking we first need to ask if the 
problem is well-posed, i.e; the existence of a pure satisfactory solution. 

We assume feasibility, that is, 7*, j € Af, are chosen such that the set 
{a = (ai, . . . ,a„) € | fj{a) > 7*, Vj} is nonempty. This means that there 
exists a vector (ei, . . . , e„), ej > such that there is an action profile a that 
satisfies Vj, fj{a) = 7j + £j- Thus, a necessary and sufficient condition for 
existence of a satisfactory solution is that the vector 7* -f e belongs to the set 
f(^") i.e., the range of the function f(a) :— {ri{a), . . . , f„(a)). 

Consider a basic wireless network with n users. The action space is Aj = 
[0, aj^max], aj",max > 0. There is a state space Mf^ w = (wi, W2, . . . , w„) where 
Wj = {wjji )j> , Wjj> = \hjj> p > 0, hjj' e C. The payoff of user j in state w is the 
signal-to-interference-plus-noise ratio: rj{w,a) = SINRj{w,a) = ^j{w,a) = 

, ^ °^ where iVn > is the background noise and e,,,' > 0. To goal 

of each player is not necessarily to maximize the payoff, it is to get a certain 
target 7*. 

Definition 4 We say that user j is satisfied if^j{w,a) > 7*. 

A satisfactory solution at state u; is a situation a where all the users are satisfied. 
Such a situation may not exist in general. We examine the case where there 
is at least one solution. Of course if the full state w and all the parameters 
are known, one can perform a centralized solution. However, in the distributed 
setting, a user may not have access to the information of the other users channel 
gains and their locations. Thus, it is important to guarantee a certain quality- 
of-service (QoS) with minimal information for all the users. Our goal here is 
to develop very fast and convergent fully distributed learning algorithms for 
satisfactory solutions. The only information assumption required to each user 
is the numerical realized value its own-payoff rj t and its own-target 7*. The 
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basic Banach-Picard fixed-point iteration is given by 



aj,t+i = proj_4^ 



a.7,t- 



7v 



where proj_4^ denotes the projection operator over the convex and compact set 
Aj, i.e., proj_4^.(a;) = min(aj,i„ax, max(0, a;)). 

Note that max(0, a;) = [a;]+ = and min(oj;inax, a) = . 

The proposed algorithm is fully distributed in the sense that a user do not 
need to observe the actions of the others in order to update its strategy itera- 
tively. 



Algorithm 8 : Fully distributed Banach-Picard learning for satisfactory solu- 
tion 



Initialization : 

Make a starting guess Uj^ 
Banach-Picard learning pattern 

For each time slot t up to T 
For each user j gM do: 
Observe a numerical value rj^t 
Compute aj^t+i = proj^^. aj^t^j^ 



Next we discuss the convergence of the Banach-Picard learning algorithm 
for a fixed state w. 

Assumption AO: p(M'") < 1 where M^^ = and M^^ = 0. 

It is clear that under assumption AO, the system (/ — M"')a — b where 
bj = -^7^ has a solution. We say that the problem is feasible if (/ — M^)a = b 
has a solution and the solution a* satisfies < a* < aj^max- 

Proposition 5 Consider the Banach-Picard learning algorithm for a fixed state 
w for which the problem is feasible. 

• Suppose that the sequences of action profiles {at}t generated by the Banach- 
Picard algorithm converges to some point a* that belongs to be relative 
interior of JJ- Aj . Then, a* is a satisfactory solution. 

• Under the sufficient condition for existence AO and feasibility condition, 
the Banach-Picard iteration converges to a satisfactory solution. 

• The convergence rate of the Banach-Picard algorithm for satisfactory so- 
lution seeking is geometrical decay and hence the convergence time within 
r] error tolerance is 

Tr, = 1-F [max(0,T)J 



where 



T — ^ 
In 1 
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Remark 2 (Advantages) This proposition is very important since a satisfac- 
tory solution can be seen as a global optimum of the game with payoff function 
ll{rj>7*}j where } denotes the indicator function. In particular, the above 
algorithm is a fully distributed learning scheme that converges ( under the exis- 
tence and feasibility condition) to a global optimum (and hence Pareto optimal) 
which is remarkable. 

Remark 3 (Limitations) The fully distributed Banach-Picard algorithm pro- 
posed above is convergent under some range of parameters, and the algorithm 
is with minimal information (it is fully distributed). However, the convergence 
time is still unsatisfactory. We aim to investigate whether it is possible to get 
a faster convergence rate. To do so, we use speedup learning techniques. 

One of the first speedup techniques for satisfactory solution is the reverse 
Ishikawa's learning consists to choose a learning rate (bigger that one) that 
converges to one. 



at+i = proj_^^. 



7. 

Xtaj,t— + (1 - At)aj,t 



(16) 



1< At < 2, limAt = 1, ao e A. (17) 



Theorem 5 Under the same assumption as in Proposition\^ and appropriate 
choice of Xt, the reverse Ishikawa learning converges faster than the Banach- 
Picard learning. 

Note that the projection is now required even if Aj is convex because for 
At > 1, one gets 1 — At < is a not convex combination. In general, it is 
difficult to compute in advance the value of At that will maximize the rate of 
convergence. 

In order to get a higher order convergence rate, one can use a Steffensen 
speedup learning of the reverse Ishikawa. 

Remark 4 In the above speedup learning for satisfaction, we have limited our- 
selves to the case where the state is quasi-static. However, in wireless networks, 
it could be stochastic, leading to a stochastic learning algorithm. Then, the goal 
is to find satisfactory solution in expectation. For that case, the assumption on 
is too restrictive. The spectral radius of the matrix may not be less 
than one for some realized state w. Then, a feasible solution may not exist but 
the algorithm converges to aj.max- 

In the context of large-scale games, one needs to scale the SINR. By choosing 
the parameter e, and aj^max = Omax > 0,7^ — 7* independent of j one can 
express the payoff of a generic user as function of Oj, m, and a load factor a. 
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Therefore, the mean- field learning version becomes 



mt+i 



- 1 
mt — 

n 



Proj^ [j*{No + amt)] 

fho e A ^ [0, amax]- 



(18) 

(19) 
(20) 



Thus, p{M'^) — 7*a < 1 and m* — < Qmax is an interior satisfactory 



1— 7*c 

solution and the payoff of a each generic player is 

7T1* (1— 7* q) 



No 



7 



The reverse Ishikawa is 



mt+i 



XtiTit h (1 - Xt)mt 

n 



= WO] A [-^47* (^0 + amt) + (1 - Xt)mt] 
= proj_4 [Xa*No + {Xa*a + (1 - Xt))mt] 
1 < At < 2, fho e A. 



(21) 

(22) 
(23) 
(24) 



For Xt = X e (1,2), one can observe compare the spectral radius: < A7*q;+(1— 
A) < 7*a < 1 Thus, the reverse Ishikawa learning has a superlinear convergence 
rate faster than the Banach-Picard fixed point iteration. 

We choose the target SINR to be 7* = 20, the scaled background noise 
No = 0.3, and the load is a = 1/30 and cimax = 20. Then, the problem is feasi- 
ble and the satisfactory solution is 18. In table [3] we illustrate the convergence 
to satisfactory solution: Banach-Picard and its speedup versions with reverse 
Ishikawa and Steffensen. We initialize the mean of the mean field fho = 2 and 
observe that 50 iterations of the Banach-Picard learning corresponds approxi- 
mately to 25 iterations of the reverse Ishikawa learning and only 5 iterations of 
Steffensen speedup algorithm. Figure 3] summarizes the three mean-field learn- 




□ 123456789 10 

Number of iterations 



Figure 4: Mean-field learning for satisfactory solution: Banach-Picard, reverse 
Ishikawa and Steffensen. 
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t 


Banach — Picard mj. 


A = 5/3 


Steffensen 


1 


2.000000000000000 


2 


12 


2 


7.333333333333333 


10.888888888888888 


14 


3 


10.88SSSSS88SS8888 


14-839506172839506 


12.000000000000002 


4 


13.259259259259260 


16-595336076817560 


9.000000000000002 


5 


14.839506172839506 


17-375704923030028 


17.999999999999986 


6 


15.893004115226336 


17-722535521346678 


17.999999999999979 


7 


16.595336076817556 


17-876682453931856 




8 


17.063557384545035 


17-945192201747496 




9 


17.375704923030021 


17-975640978554445 




10 


17.583803282020014 


17-989173768246424 




11 


17.72253552134667 5 


17-905188341442855 




12 


17.815023680897784 


17-997861485085711 




13 


17.876682453931856 


17.999049548926983 




14 


17.917788302621240 


17.999577577300883 




15 


17.945192201747492 


17-099812256578174 




16 


17.963461467831664 


17-999916558479192 




17 


17.975640978554441 


17-9999629148V9("i39 




IS 


17.983760652369625 


17-999083517721282 




19 


17.989173768246417 


17-999992671 5 1112 4 




20 


17.992782512164279 


1 7 - 9 9 9 9 9 fi 7 '1 '1 2 '11 8 3 




21 


17.995188341442852 


17-999998 5 5 2 9 9 6 3 6 7 




22 


17.996792227628568 


17-99999 9 3 5 6 887276 




23 


17.997861485085714 


17.999999714172120 




24 


17.998574323390475 


17.999999872965383 




25 


17.999049548926983 


17.999999943540168 




26 


17.999366365951325 


17.999999974906743 




27 


17.999577577300883 






28 


17.999718384867258 






29 


17.999812256578171 






30 


17. 999874837718778 






31 


17.99991655817918 5 






32 


1 7 . 9 9 9 9 4 1 3 7 2 3 1 i) 4 5 5 






33 


17.999962914879639 






34 


17.999975276586426 






35 


17.999983517724285 






36 


17.999989011816190 






37 

38 


17.999992674544124 
17.999995116362747 






39 


17.999996744241834 






40 


17.999997829494557 






41 


17.999998552996370 






42 


17.999999035330912 






43 


17.999999356887276 






44 


17.999999571258183 






45 


17.999999714172120 






46 


17.999999809448077 






47 


17.999999872965383 






48 


17.999999915310255 






49 


17.999999943540171 






50 


17.999999962360114 







Table 3: Acceleration of mean-field learning for Satisfactory solution. 

ing trajectories. 

The satisfactory solution estimated by Banach-Picard mean-field is to* = 
17.999999962360114 and the the error estimate for to* in the Banach-Picard 
learning is 1.2547 x 10~^ after 50 iterations. The first speedup technique based 
on reversed Ishikawa with A = 5/3 > 1 gets 17.999999974906743 after 26 iter- 
ations which clearly a superlinear convergence rate. The error estimate of re- 
verse Ishikawa speedup is 1.3941 x 10~* after 26 iterations starting from toq = 2 
which is far away from the satisfactory solution. We can get a higher order 
convergence rate. The speedup technique a la Steffensen provides an error of 
7 X 10~^^ after only 6 iterations. The numerical gap d{fht, f{fht)) is in or- 
der of 7.105427357601002 x 10"^^ which is an acceptable error tolerance for 6 
iterations. 
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6 Concluding remarks 



In this paper we have studied mean-field learning in large-scale systems. Our 
result shows that in large-scale aggregative games with additive aggregation 
term, the mean-field learning simplifies not only the complexity of the problem 
(instead of systems of iterative equations, we can just reduce to one-equation 
per class or type) but also the information requirement. We have examined both 
convergence time and speed of convergence and proposed acceleration techniques 
for partially distributed mean-field learning with convergence time in order of 
0(log(logi)). 

In the fully distributed mean-field learning, we have assumed that each player 
is able to observe/measure a noisy (numerical) value of her payoff in a linear 
way. Now, what happens if this assumption does not hold but a correlated non- 
linear signal is observed as output? After an experimentation at time t, player j 
observes the realized process Oj^t = f{'>'j,t,^j,t) where / is the observation func- 
tion and ^j^t is the measurement noise. If the observation function / is binary 
then one gets a — 1 output or a noisy ACK/NACK feedback. If / is known 
and invertiblc with the respect to the first component, one can use a non-linear 
mean field estimator to track the "true" payoff function simultaneously with 
the strategy. As a third alternative, we have seen that no-feedback mean-field 
learning is possible. In it, each player estimates the initial mean and iterates of- 
fline without any observation/signal from the system. Several questions remain 
open: 

(i) In the mean-field learning without feedback. How to estimate the starting 
point by each player and what is the impact on the inconsistency of the process 
with the respect to the mean-field? 

(ii) What is the outcome of the mean field game if some fraction of players are 
with partially distributed learning, some fraction with fully distributed learning 
schemes and some others without any feedback learning? 

(iii) How to extend the mean-field learning framework to payoff functions 
that depend not only on the mean also but on higher moments or the entire 
mean-field distribution? 

(iv) Our analysis of speedup learning algorithms are limited to deterministic 
function. It is interesting to investigate the stochastic version of ([H). 

We do not have answers to these questions and postpone them for future 
investigation. 
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Proof of Theorem [4) Let t > 2. We reiterate the recursive equation 

Vt < c^rfttl (25) 

< C2(c,^°+i)°+^=ci+°+iry(°t')^ (26) 

< 4+"'+^'(c,<+3^)<°+^)^ (27) 
= (28) 

< c2+'°+^'+'°+^^'+---+^°+^^' '7/^°+^^' (29) 

We remember that for (77^ 1, l + g + ... + g*^^ = ^r-j"- Thus, 

m<c^ ° Vo ■ 

This means that the convergence time to be within an 77*— neighorbhood of the 
mean-field equilibrium is at most for t satisfying ° Vo < V*- Thus, 
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By taking the logarithm twice, one gets 



1 



hi(o+ 1) 



In 



In 



In 



which is the announced result. 
Proof of Proposition [2l 

Using the strict contraction map /, we estimate the error at time t and get 



I.e. 



. r?(l - ai) 

ai < 



Taking the logarithm yields, 



(i(TOo, TO* 



1 r '^(™o 

I- r)(l-ai) -I 
ai 



This completes the proof. 
Proof of Proposition [3j 

To prove the convergence time of learning with non-expansive map, we use 
the Theorem [2l The error bound to an approximated fixed-point is 

d{fho,fix{f)) 



Remark that A(l — A) < -j for A e [0, 1]. Then, the convergence time is at most 
for a time t that satisfies 

d{mQjix{f)) 



'TTt 

16d{moJix{f)f 



< t. 



Hence, 



16dimoJix{f)y 



Proof of Proposition[4) The proof follows immediately from the geometric 
decay inequality in Theorem [3] following similar lines as in Proposition [51 

Proof of Proposition[l} Let Zt — a jt The fmiction z is differentiable 
and the time derivative is 



Zt = fi^f^X, ds)' 



dt 



A, ds 



Xtfia 



J^K ds) 



34 



Moreover, zq — uq. By Cauchy's theorem, zt = bt- Suppose now that a* reaches 
a target set O with at most Ta time units. Then, the trajectory of b reaches the 
same set O for at most Tb = g~^{Ta) where g : t — > As ds. Since if A > 
and A non-integrable then g{t) = Ta has a solution. 
If As = A, then Tb = ^. 

If As = e*, then Tb = l'n{Ta + 1), i.e., ^ goes to zero and b is faster than a. 
Proof of Proposition [5t 

If the algorithm converges to a some point a* then combining the continuity 
of the projection map and the continuity of the payoff function, one gets the 
righthandside of 

o-j = Proj^^ 

which is continuous in at- Taking the limit as t goes to infinity, one gets a* = 
Oj,max or a* — {I — M^)~^b and the payoff of player j is r* — 7* which means 
that every player j is satisfied in interior steady state. 

Now assume feasibility and assumption AO. Then, From Perron- Frobenius 
theorem, we known that (/ — M'^)~^ exists and (/ — M'^)~^b is positive com- 
ponentwise. Under feasibility condition, the algorithm generates an error as 

rf(at,a*) <p(A/"')*rf(ao,a*), 

which provides the convergence of the algorithm to a* . We use similar analysis 
as in Proposition |4] to deduce the convergence time 

1 r d(ao,a*) -| 
T — ^ 

In 1 

Proof of Theorem [5j 

Following the proof of Proposition [SJ one gets that there is some time T > 1 
such that for all t > T the spectral radius of the time- varying matrix AfM™ -f 
(1 — At)/ for At G (1,2) is less that p(M"') i.e., the reversed Ishikawa has a 
superlinear convergence rate. If p(M"') < 1 then both algorithms converge to 
the same point and the reverse Ishikawa learning algorithm converges faster 
than the Banach-Picard fixed-point. The reverse Ishikawa is a speedup version 
of the Banach-Picard algorithm and the announced result follows. 
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